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Abstract 


Calculation of the log-normalizer is a major computational obstacle in applications 
of log-linear models with large output spaces. The problem of fast normalizer 
computation has therefore attracted significant attention in the theoretical and 
applied machine learning literature. In this paper, we analyze a recently proposed 
technique known as “self-normalization”, which introduces a regularization term 
in training to penalize log normalizers for deviating from zero. This makes it 
possible to use unnormalized model scores as approximate probabilities. Empirical 
evidence suggests that self-normalization is extremely effective, but a theoretical 
understanding of why it should work, and how generally it can be applied, is largely 
lacking. 

We prove generalization bounds on the estimated variance of normalizers and 
upper bounds on the loss in accuracy due to self-normalization, describe classes of 
input distributions that self-normalize easily, and construct explicit examples of 
high-variance input distributions. Our theoretical results make predictions about 
the difficulty of fitting self-normalized models to several classes of distributions, 
and we conclude with empirical validation of these predictions. 


1 Introduction 

Log-linear models, a general class that includes conditional random fields (CRFs) and generalized 
linear models (GLMs), offer a flexible yet tractable approach modeling conditional probability 
distributions p(x\y) EEl- When the set of possible y values is large, however, the computational 
cost of computing a normalizing constant for each x can be prohibitive—involving a summation with 
many terms, a high-dimensional integral or an expensive dynamic program. 


The machine translation community has recently described several procedures for training “self- 
normalized” log-linear models 00. The goal of self-normalization is to choose model parameters 
that simultaneously yield accurate predictions and produce normalizers clustered around unity. Model 
scores can then be used as approximate surrogates for probabilities, obviating the computation 
normalizer computation. 

In particular, given a model of the form 


p v (y \ x ) = e r < TT{v ' x) - A ^' x) 


( 1 ) 


with 



v&y 


we seek a setting of 77 such that A(x, rf) is close enough to zero (with high probability under p(x)) to 
be ignored. 
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This paper aims to understand the theoretical properties of self-normalization. Empirical results have 
already demonstrated the efficacy of this approach—for discrete models with many output classes, it 
appears that normalizer values can be made nearly constant without sacrificing too much predictive 
accuracy, providing dramatic efficiency increases at minimal performance cost. 


The broad applicability of self-normalization makes it likely to spread to other large-scale applications 
of log-linear models, including structured prediction (with combinatorially many output classes) 
and regression (with continuous output spaces). But it is not obvious that we should expect such 
approaches to be successful: the number of inputs (if finite) can be on the order of millions, the 
geometry of the resulting input vectors x highly complex, and the class of functions ,4(r/. x) 
associated with different inputs quite rich. To find to find a nontrivial parameter setting with A(j], x) 
roughly constant seems challenging enough; to require that the corresponding 77 also lead to good 
classification results seems too much. And yet for many input distributions that arise in practice, it 
appears possible to choose 77 to make A(r], x) nearly constant without having to sacrifice classification 
accuracy. 


Our goal is to bridge the gap between theoretical intuition and practical experience. Previous work 
0 bounds the sample complexity of self-normalizing training procedures for a restricted class of 
models, but leaves open the question of how self-normalization interacts with the predictive power 
of the learned model. This paper seeks to answer that question. We begin by generalizing the 
previously-studied model to a much more general class of distributions, including distributions with 
continuous support (Section 3 1. Next, we provide what we believe to be the first characterization of 
the interaction between self-normalization and model accuracy Section 4 This characterization is 
given from two perspectives: 


• a bound on the “likelihood gap” between self-normalized and unconstrained models 

• a conditional distribution provably hard to represent with a self-normalized model 

In Figure 5 we present empirical evidence that these bounds correctly characterize the difficulty of 
self-normalization, and in the conclusion we survey a set of open problems that we believe merit 
further investigation. 


2 Problem background 

The immediate motivation for this work is a procedure proposed to speed up decoding in a machine 
translation system with a neural-network language model 0. The language model used is a standard 
feed-forward neural network, with a “softmax” output layer that turns the network’s predictions into 
a distribution over the vocabulary, where each probability is log-proportional to its output activation. 
It is observed that with a sufficiently large vocabulary, it becomes prohibitive to obtain probabilities 
from this model (which must be queried millions of times during decoding). To fix this, the language 
model is trained with the following objective: 

max^ [N{ yi \ Xi ;W) -\og^e N ( y '\ Xi '’ W) - a(\ogJ2 eN(v ' lx '' W) ) 1 2 

i y’ y' 

where N(y\x; W) is the response of output y in the neural net with weights W given an input x. From 
a Lagrangian perspective, the extra penalty term simply confines the W to the set of “empirically 
normalizing” parameters, for which all log-normalizers are close (in squared error) to the origin. For 
a suitable choice of a, it is observed that the trained network is simultaneously accurate enough to 
produce good ttanslations, and close enough to self-normalized that the raw scores N(yi\xi) can be 
used in place of log-probabilities without substantial further degradation in quality. 

We seek to understand the observed success of these models in finding accurate, normalizing parameter 
settings. While it is possible to derive bounds of the kind we are interested in for general neural 
networks 0, in this paper we work with a simpler linear parameterization that we believe captures 
the interesting aspects of this problem. Q 

1 It is possible to view a log-linear model as a single-layer network with a softmax output. More usefully, 
all of the results presented here apply directly to trained neural nets in which the last layer only is retrained to 

self-normalize 0. 
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Related work 


The approach described at the beginning of this section is closely related to an alternative self¬ 
normalization trick described based on noise-contrastive estimation (NCE) |$j. NCE is an alternative 
to direct optimization of likelihood, instead training a classifier to distinguish between true samples 
from the model, and “noise” samples from some other distribution. The structure of the training 
objective makes it possible to replace explicit computation of each log-normalizer with an estimate. In 
traditional NCE, these values are treated as part of the parameter space, and estimated simultaneously 
with the model parameters; there exist guarantees that the normalizer estimates will eventually 
converge to their true values. It is instead possible to fix all of these estimates to one. In this case, 
empirical evidence suggests that the resulting model will also exhibit self-normalizing behavior | 4 |. 

A host of other techniques exist for solving the computational problem posed by the log-normalizer. 
Many of these involve approximating the associated sum or integral using quadrature a, herding 
m, or Monte Carlo methods ED- For the special case of discrete, finite output spaces, an alternative 
approach—the hierarchical softmax—is to replace the large sum in the normalizer with a series 
of binary decisions H 3 - The output classes are arranged in a binary tree, and the probability of 
generating a particular output is the product of probabilities along the edges leading to it. This reduces 
the cost of computing the normalizer from 0 (k) to 0 (log k). While this limits the set of distributions 
that can be learned, and still requires greater-than-constant time to compute normalizers, it appears to 
work well in practice. It cannot, however, be applied to problems with continuous output spaces. 


3 Self-normalizable distributions 


We begin by providing a slightly more formal characterization of a general log-linear model: 

Definition 1 (Log-linear models). Given a space of inputs X, a space of outputs y, a measure //. on 
y, a nonnegative function h : y ^ U, and a function T : X x y —>• IR d that is /i-measurable with 
respect to its second argument, we can define a log-linear model indexed by parameters 77 £ U d , with 
the form 


p v (y\x) = h(y)e^ T ^~ A ^ , 


( 3 ) 


where 


A(x,rj) = log [ h(y)e r,TT(x ’ v) d p(y) . 

jy 


( 4 ) 


If A(x,i 7) < 00, then f y p v (y\x) d p{y) 


1, and p v (y\x) is a probability density over 



We next formalize our notion of a self-normalized model. 

Definition 2 (Self-normalized models). The log-linear model p v {y\x) is self-normalized with re¬ 
spect to a set S C X if for all x £ S, A(x, if) = 0 . In this case we say that S is self-normalizable, 
and i] is self-normalizing w.r.t. S. 


An example of a normalizable set is 


shown in|FigureTa and we provide additional examples below: 


2 Some readers may be more familial' with generalized linear models, which also describe exponential family 
distributions with a linear dependence on input. The presentation here is strictly more general, and has a few 
notational advantages: it makes explicit the dependence of d on 1 and y but not y, and lets us avoid tedious 
bookkeeping involving natural and mean parameterizations. ED 
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(a) A self-normalizable set S for fixed ry. the solutions 
(xi, X2 ) to A(x, rj) = 0 with r/ T T(x , y) = pj x and 
77 = {(— 1 , 1 ), (— 1 , — 2 )}. The set forms a smooth 
one-dimensional manifold bounded on either side by 
hyperplanes normal to (—1,1) and (—1, —2). 



(b) Sets of approximately normalizing parameters 77 
for fixed p(x)\ solutions (771,172) to E[A(x, 77) 2 ] = S 2 
with T(x, y) m (x + y,-xy), y € {- 1 , 1 } and 
p(x) uniform on { 1 , 2 }. For a given upper bound on 
normalizer variance, the feasible set of parameters is 
nonconvex, and grows as 5 increases. 


Figure 1 : Self-normalizable data distributions and parameter sets. 


Example. Suppose 

S = {log 2 ,-log 2 } , 

y = {—It 1} 

T(x,y) = [xy, 1] 

V = (l,log( 2 / 5 )) . 

Then for either x £ S, 

A(x, 77) = log(e log2+1 ° s(2 / 5) + e -l°g2+log(2/5)) 

= log((2/5)(2 + l/2)) 

= 0 , 

and 77 is self-normalizing with respect to S. 

It is also easy to choose parameters that do not result in a self-normalized distribution, and in fact to 
construct a target distribution which cannot be self-normalized: 

Example. Suppose 

x = {(1,0), (0,1), (1,1)} 

y = {-ij 1} 

T(x,y) = [x^y, x 2 y, 1) 

Then there is no p such that A(x , 77) = 0 for all x, and A(x, rj) is constant if and only ift] = 0. 


As previously motivated, downstream uses of these models may be robust to small errors resulting 
from improper normalization, so it would be useful to generalize this definition of normalizable 
distributions to distributions that are only approximately normalizable. Exact normalizability of 
the conditional distribution is a deterministic statement—there either does or does not exist some 
x that violates the constraint. In Figure la for example, it suffices to have a single x off of the 
indicated surface to make a set non-normalizable. Approximate normalizability, by contrast, is 
inherently a probabilistic statement, involving a distribution p(x') over inputs. Note carefully that 
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we are attempting to represent p(y\x) but have no representation of (or control over) p(x), and that 
approximate normalizability depends on p{x) but not p(y\x). 

Informally, if some input violates the self-normalization constraint by a large margin, but occurs only 
very infrequently, there is no problem; instead we are concerned with expected deviation. It is also 
at this stage that the distinction between penalization of the normalizer vs. log-normalizer becomes 
important. The normalizer is necessarily bounded below by zero (so overestimates might appear 
much worse than underestimates), while the log-normalizer is unbounded in both directions. For 
most applications we are concerned with log probabilities and log-odds ratios, for which an expected 
normalizer close to zero is just as bad as one close to infinity. Thus the log-normalizer is the natural 
choice of quantity to penalize. 

Definition 3 (Approximately self-normalized models). The log-linear distribution p v (y\x) is 6- 
approximately normalized with respect to a distribution p{x) over X if E[A(X, rf) 2 ] < S 2 . In this case 
we say that p[x) is 5-approximately self-normalizable, and p is S-approximately self-normalizing. 


The sets of ((-approximately self-normalizing parameters for a fixed input distribution and feature 
function are depicted in Figure lb Unlike self-normalizable sets of inputs, self-normalizing and 


approximately self-normalizing sets of parameters may have complex geometry. 


Throughout this paper, we will assume that vectors of sufficient statistics T(x,y) have bounded 
f2 norm at most R, natural parameter vectors // have !'•> norm at most B (that is, they are Ivanov- 
regularized), and that vectors of both kinds lie in IR d . Finally, we assume that all input vectors have a 
constant feature—in particular, that xq = 1 for every x (with corresponding weight r/o). [^] 


The first question we must answer is whether the problem of training self-normalized models is 
feasible at all—that is, whether there exist any exactly self-normalizable data distributions p{x), 
or at least ((-approximately self-normalizable distributions for small 5 . Section 3 ] already gave an 
example of an exactly normalizable distribution. In fact, there are large classes of both exactly and 
approximately normalizable distributions. 


Observation. Given some fixed rj, consider the set S rj = {x G X : A(x,r ]) = 0 }. Any distribu¬ 
tion p{x) supported on S v is normalizable. Additionally, every self-normalizable distribution is 
characterized by at least one such rj. 


This definition provides a simple geometric characterization of self-normalizable distributions. An 
example solution set is shown in Figure la More g enerally, if y is discrete and T(x. y) consists of 
|^| repetitions of a fixed feature function t(x) (as in|Figure~j~ai, then we can write 


A(x,ri) =\ogJ2 e ^ t(x) - 
vey 


( 5 ) 


Provided r] y t(x) is convex in x for each rj y , the level sets of A as a function of x form the boundaries 
of convex sets. In particular, exactly normalizable sets are always the boundaries of convex regions, 
as in the simple example [Figure la] 

We do not, in general, expect real-world datasets to be supported on the precise class of self- 
normalizable surfaces. Nevertheless, it is very often observed that data of practical interest lie on 
other low-dimensional manifolds within their embedding feature spaces. Thus we can ask whether it 
is sufficient for a target distribution to be well-approximated by a self-normalizing one. We begin by 
constructing an appropriate measurement of the quality of this approximation. 

Definition 4 (Closeness). An input distribution p[x) is D-close to a set S if 


E 


inf sup||T(X,t/) -T(a;*,t/)|| 2 

S y( zy 


< D 


( 6 ) 


In other words, p(x) is /Aclose to S if a random sample from p is no more than a distance D from S 
in expectation. Now we can relate the quality of this approximation to the level of self-normalization 
achieved. Generalizing a result from 0, we have: 

Tt will occasionally be instructive to consider the special case where X is the Boolean hypercube, and we 
will explicitly note where this assumption is made. Otherwise all results apply to general distributions, both 
continuous and discrete. 
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Proposition 1 . Suppose p(x) is D-close to {x : A(x, 77) = 1 }. Then p(x) is BD-approximately 
self-normalizable (recalling that ||a;||2 < B). 

(Proofs for this section may be found in | Appendix A ) 

The intuition here is that data distributions that place most of their mass in feature space close to 
normalizable sets are approximately normalizable on the same scale. 

4 Normalization and model accuracy 

So far our discussion has concerned the problem of finding conditional distributions that self- 
normalize, without any concern for how well they actually perform at modeling the data. Here 
the relationship between the approximately self-normalized distribution and the true distribution 
p(y\x) (which we have so far ignored) is essential. Indeed, if we are not concerned with making 
a good model it is always trivial to make a normalized one—simply take r) = 0 and then scale po 
appropriately! We ultimately desire both good self-normalization and good data likelihood, and 
in this section we characterize the tradeoff between maximizing data likelihood and satisfying a 
self-normalization constraint. 

We achieve this characterization by measuring the likelihood gap between the classical maximum 
likelihood estimator, and the MLE subject to a self-normalization constraint. Specifically, given pairs 
(x 2 ,y 2 ), ■■■, (x n ,y n )), let£(rj\ x,y) = J 2 , lo EPv(y*\ x *)- Then define 


t) = argmaxf(77|a:, y) 

V 

(7) 

fjs = argmax^(77 \x,y) 

(8) 


r):V(ri)<5 

(where V(rj) = ± A(x t , rj) 2 ). 

We would like to obtain a bound on the likelihood gap , which we define as the quantity 

= ^(i(f]\x,y) -e(fis\x,y)) . (9) 

We claim: 

Theorem 2. Suppose y has finite measure. Then asymptotically as n ^ oo 

A t(fj,fj s ) < ^1 - EKL(p v (-\X) || Unif) . (10) 

(Proofs for this section may be found in | Appendix B ) 

This result lower-bounds the likelihood at f\$ by explicitly constructing a scaled version of 77 that 
satisfies the self-normalization constraint. Specifically, if 77 is chosen so that normalizers are penalized 
for distance from log p(y) (e.g. the logarithm of the number of classes in the finite case), then any 
increase in 77 along the span of the data is guaranteed to increase the penalty. From here it is possible 
to choose an a £ (0,1) such that afj satisfies the constraint. The likelihood at afj is necessarily less 
than £(fjs\x, y), and can be used to obtain the desired lower bound. 

Thus at one extreme, distributions close to uniform can be self-normalized with little loss of likelihood. 
What about the other extreme—distributions “as far from uniform as possible”? With suitable 
assumptions about the form of Pf/(y\x), we can use the same construction of a self-normalizing 
parameter to achieve an alternative characterization for distributions that are close to deterministic: 

Proposition 3. Suppose that X is a subset of the Boolean hypercube, y is finite, and T(x , y) is the 
conjunction of each element of x with an indicator on the output class. Suppose additionally that in 
every input x, Pf/(y\x) makes a unique best prediction—that is, for each x £ X, there exists a unique 
y* £ y such that whenever y 7 ^ y*, r] T T(x , y*) > rj T T(x , y). Then 

Ae{v,ns) < b ^|| 77|| 2 - e~ cS/R ( 11 ) 

for distribution-dependent constants b and c. 
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This result is obtained by representing the constrained likelihood with a second-order Taylor expansion 
about the true MLE. All terms in the likelihood gap vanish except for the remainder; this can be 
upper-bounded by the 11775111 times the largest eigenvalue the feature covariance matrix at 77,5, which 
in turn is bounded by e~ cS ^ R . 


The favorable rate we obtain for this case indicates that “all-nonuniform” distributions are also an 
easy class for self-normalization. Together with Theorem 2 this suggests that hard distributions must 
have some mixture of uniform and nonuniform predictions for different inputs. This is supported by 
the results in lSection~ 4 l 


The next question is whether there is a corresponding lower bound; that is, whether there exist 
any conditional distributions for which all nearby distributions are provably hard to self-normalize. 
The existence of a direct analog of Theorem 2 remains an open problem, but we make progress by 
developing a general framework for analyzing normalizer variance. 


One key issue is that while likelihoods are invariant to certain changes in the natural parameters, the 
log normalizers (and therefore their variance) is far from invariant. We therefore focus on equivalence 
classes of natural parameters, as defined below. Throughout, we will assume a fixed distribution p(x) 
on the inputs x. 

Definition 5 (Equivalence of parameterizations). Two natural parameter values 77 and r\' are said 
to be equivalent (with respect to an input distribution p(x)), denoted 77 ~ 77' if 

Pv(v\ x ) =Pr,'(y \ x ) a - s - p( x ) 


We can then define the optimal log normalizer variance for the distribution associated with a natural 
parameter value. 

Definition 6 (Optimal variance). We define the optimal log normalizer variance of the log-linear 
model associated with a natural parameter value 77 by 

V* (?7) = inf Var (x) [A(X, 77)]. 


We now specialize to the case where y is finite with 3 ^ = K and where T: y x X —y [R xd satisfies 

T(k, x^k'j — dkk'Xj. 


This is an important special case that arises, for example, in multi-way logistic regression. In this 
setting, we can show that despite the fundamental non-identifiability of the model, the variance can 
still be shown to be high under any parameterization of the distribution. 

Theorem 4. Let X = { 0 , l} c/ and let the input distribution pix) be uniform on X. There exists an 
t]° £ such that for 77 = at] 0 , a > 0, 


V*(v) > 


[Ml 

32 d{d- 1 ) 


4 ATT 



2(d-l) 


Nla- 


5 Experiments 


The high-level intuition behind the results in the preceding section can be summarized as follows: 1 ) 
for predictive distributions that are in expectation high-entropy or low-entropy, self-normalization 
results in a relatively small likelihood gap; 2) for mixtures of high- and low-entropy distributions, 
self-normalization may result in a large likelihood gap. More generally, we expect that an increased 
tolerance for normalizer variance will be associated with a decreased likelihood gap. 


In this section we provide experimental confirmation of these predictions. We begin by generating a 
set of random sparse feature vectors, and an initial weight vector 770. In order to produce a sequence 
of label distributions that smoothly interpolate between low-entropy and high-entropy, we introduce 
a temperature parameter r, and for various settings of r draw labels from p Tr) . We then fit a self- 
normalized model to these training pairs. In addition to the synthetic data, we compare our results to 
empirical data 0 from a self-normalized language model. 


Figure 2 a plots the tradeoff between the likelihood gap and the error in the normalizer, under various 


distributions (characterized by their KL from uniform). Here the tradeoff between self-normalization 
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(a) Normalization / likelihood tradeoff. As the normal¬ 
ization constraint 8 is relaxed, the likelihood gap Ae 
decreases. Lines marked “KL=” are from synthetic 
data; the line marked “LM” is from 0 



EA'L( Pl7 ||Unif) 


(b) Likelihood gap as a function of expected diver¬ 
gence from the uniform distribution. As predicted by 
theory, the likelihood gap increases, then decreases, 
as predictive distributions become more peaked. 


Figure 2 ; Experimental results 


and model accuracy can be seen—as the normalization constraint is relaxed, the likelihood gap 
decreases. 


Figure 2 b shows how the likelihood gap varies as a function of the 
As predicted, it can be seen that both extremes of this quantity result 
intermediate values result in large likelihood gaps. 


quantity E KL {p v (• | X) \ | Unif). 
in small likelihood gaps, while 


6 Conclusions 


Motivated by the empirical success of self-normalizing parameter estimation procedures for log-linear 
models, we have attempted to establish a theoretical basis for the understanding of such procedures. 
We have characterized both self-normalizable distributions, by constructing provably easy examples, 
and self-normalizing training procedures, by bounding the loss of likelihood associated with self¬ 
normalization. 

While we have addressed many of the important first-line theoretical questions around self¬ 
normalization, this study of the problem is by no means complete. We hope this family of problems 
will attract further study in the larger machine learning community; toward that end, we provide the 
following list of open questions: 


1 . 


How else can the approximately self-normalizable distributions be characterized? The 

class of approximately normalizable distributions we have described is unlikely to corre¬ 
spond perfectly to real-world data. We expect that Proposition 1 can be generalized to other 
parametric classes, and relaxed to accommodate spectral or sparsity conditions. 


2 . Are the upper bounds in Theorem 2 or Proposition 3 tight? Our constructions involve 


relating the normalization constraint to the t 2 norm of 77, but in general some parameters 
can have very large norm and still give rise to almost-normalized distributions. 


3 . Do corresponding lower bounds exist? While it is easy to construct of exactly self- 
normalizable distributions (which suffer no loss of likelihood), we have empirical evidence 
that hard distributions also exist. It would be useful to lower-bound the loss of likelihood in 
terms of some simple property of the target distribution. 


4 . Is the hard distribution in Theorem 4 stable? This is related to the previous question. 
The existence of high-variance distributions is less worrisome if such distributions are 
comparatively rare. If the variance lower bound falls off quickly as the given construction 
is perturbed, then the associated distribution may still be approximately self-normalizable 
with a good rate. 


We have already seen that new theoretical insights in this domain can translate directly into practical 
applications. Thus, in addition to their inherent theoretical interest, answers to each of these questions 
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might be applied directly to the training of approximately self-normalized models in practice. We 
expect that self-normalization will find increasingly many applications, and we hope the results 
in this paper provide a first step toward a complete theoretical and empirical understanding of 
self-normalization in log-linear models. 
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A Normalizable distributions 


Proof oj \Proposition l\ (distributions close to normalizable sets are approximately normalizable). 
Let T(x,y) = T*(x,y) + T~(x,y), where T*(x, y) = argmin \\T(X,y) - T(x,y)\\ 2 ■ 


E 


e v T T(x, y ) 



= E ^log ( A > y )+ T ( A,y ))d ■v'j'j 

< E (log (e r ' Tf J e r > Tj "' ( ~ x ’ v) dyj'j 


for f = argmax||ry T T(A, 2 /)|| 2 , 

T(X,y) 


<E(log (e^^)) 2 

= {DBf 


□ 


B Normalization and likelihood 

B.l General bound 

Lemma 5. If 11 77 1 1 2 < S/R, then p n (y\x) is 5-approximately normalized about log/i(3^). 

Proof. If / e r,Trr ( X ’ y ' > d p(y) > log p(y), 

log J e r,TT< ' X,y ' > My) - log/x^)) < (log J e^ R dp(y) - log p(y) 

= U\l R 2 
<6 2 

The case where f e v T ( X ’ y l dp(y) < log p(y) is analogous, instead replacing p T T(x,y) with 
— 11 771 1 2 -R- The variance result follows from the fact that every log-partition is within 6 of the 
mean. □ 


Proof oj \Tl le oreni2\ (loss of likelihood is bounded in terms of distance from uniform). Consider the 
likelihood evaluated at afj, where a = 5/R\\f)\ | 2 . We know that 0 < a < 1 (if <5 > Ri 7 , then 
the MLE already satisfying the normalizing constraint). Additionally, p a fi(y\x) is ^-approximately 
normalized. (Both follow from |Lemma 5| ) 

Then, 


Ar = ^ [(f7 T T(xi, yi) - A(x i: f/)) - {at) T T(xi, yj) - A(x iy ar}))] 

i 

= ~ K 1 ~ a )v TT ( x i^yi) - A(xi, fj) + A(xi,af))] 

i 

Because A(x, ay) is convex in a. 


A(xi , afj) < (1 — a)A(xi, 0 ) + aA{xi, fj) 
= (1 - a)p(y) + aA(xi, f/) 
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Thus, 


Ai = - [(1 - a)i) r T(xi, yi) - A{xi,fj) + (1 - a) log n(y) + aA(xi,rj)\ 

i 

= ( 1 _a )n^ \v TT ( x i,yi) - A(xi,fi) + logv(y)] 

i 

= _ a )“ Y1 l lo SPv(v\ x ) ~ logUnif(y)] 

i 

x (1 - a) EKL(p^(-|X) || Unif) 

s (‘-w) EKL W«"“ f) 

B.2 AU-nonuniform bound 

We make the following assumptions: 

• Labels y are discrete. That is, y = {1,2,..., A;} for some k. 

• x £ T~L(d). That is, each a: is a {0,1} indicator vector drawn from the Boolean hypercube in 
q dimensions. 

• Joint feature vectors T(x, y) are just the features of x conjoined with the label y. Then 
it is possible to think of rj as a sequence of vectors, one per class, and we can write 

r] T T(x, y) = r^x. 

• As in the body text, let all MLE predictions be nonuniform, and in particular let each 

Vy*x - f/yX > c\\fj\\ for y ^ y*. 

Lemma 6 . For a fixed x, the maximum covariance between any two features aand Xj under the 
model evaluated at some 77 in the direction of the MLE: 

Cov[T(X, Y)i, T(X, Y)j\X = x\ < 2(k - l)e~ cS (12) 


Proof If either i or j is not associated with the class y, or associated with a zero element of x , then 
the associated feature (and thus the covariance at ( 1 . j )) is identically zero. Thus we assume that i 
and j are both associated with y and correspond to nonzero elements of x. 


Cov[Ti,Tj\X = x] = ^2p v (y\x) - Pr,(y\x) 
y 

Suppose y is the majority class. Then, 

Pv(y\ x ) -Pv(y\ x f = 


e 2r, y x 


< 


£ V e ' 

e v y x ( 

( e , 
E„. A”) 

T \ 2 

/ eV j 

- e 2 r, y x 

e v y x ( 

E». e"P) 

) 2 

- e 2 r, y x 

e 2 ri v x 


= giv'y-Vy ) 1 

v'^y 

< (k- l)e~ cM 
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Now suppose y is not in the majority class. Then, 


p v (y\x) -p v (y\x) 2 < p(y\x) 


e 'y ' 


v ey 


< e 


Thus the covariance 


^2pv(v\ x ) -Pv(v\ x ) 2 ^ 2 ( fc - l ) e c|HI1 

V 

□ 

Lemma 7. Suppose 77 = f3rj for some /3 < 1. Then for a sequence of obserx’ations (xi,. .., x n ), 
under the model evaluated at £, the largest eigenvalue of the feature covariance matrix 

^ £ [E 5 [TT t |X = Xi] - (Ee[T\X = ^])(E ? [T|X = Xi]) T ] (13) 

is at most 

q{k-l)e~ c ^ (14) 


Lemma 6 


each entry in the covariance matrix is at most (k — l)e 


-c|NI _ 


= (k- 


Proof. From 

l) e -c/3||jj|| At m ost q features are nonzero activ e in any row o f the matrix. Thus by Gershgorin’s 
theorem, the maximum eigenvalue of each tenn in Equation 13 is q(k — which is also an 

upper bound on the sum. □ 


Proof of Proposition 3 (loss of likelihood goes as e s ). As before, let us choose fjs = on), with a = 
S/R\\fj\\ 2 - We have already seen that this choice of parameter is normalizing. 

Taking a second-order Taylor expansion about 77 , we have 

log Pf)s(y\ x ) = l °EPv(y\ x ) + (Vs ~ fj ) T V log p n (y | x) + (775 - ^) T VV T log p/:(y\x)(f) s - 77) 

= log(2/1 aO + (fis - t)) t VV T log p^(y\x)(fjs - v) 


where the first-order term vanishes because 77 is the MLE. It is a standard result for exponential 
families that the Hessian in the second-order term is just Equation 13 Thus we can write 


> logprj(y\x) - 11 775 - f/W 2 q(k - l)e _c/3|NI 

> logp^(y|a;) - (1 - a) 2 \\i)\\ 2 q(k - l)e _c “ 1,1,11 
= l°gPf){y\ x ) - (Ibill - S/R) 2 q(k - l)e~ cS/R 

The proposition follows. □ 


C Variance lower bound 

Let 

Uq = {/3 £ U Kd : 3/3 £ U d ,Pkj = fij, 1 < k < K, 1 < j < d}. 

Lemma 8. If span (X) = IR ci , then equivalence of natural parameters is characterized by 

V ~ r\ <=> 77 - 77' e U 0 . 

Proof For x £ X, denote by P v (x) £ Ax the distribution over y. Now, suppose that 77 ~ 77' and 
fix x £ X. By the definition of equivalence, we have 

Pri{x)k _ Pt)' [x)k 

Prji, x ^)k' Pr]'{, x ^)k' 
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which immediately implies 


whence 


(Vk - Vk'f x = (r]' k - rf k ,) x , 
l(Vk - Vk) - (%' - Vk')f x = 0. 


Since this holds for all x £ X and span(A’) = R , we get 


That is, if we define 
we get 


Vk ~Vk= Vk ' - Vk" 


Pj = Vij - v'lj, 

Vkj — Vkj = Vld ~ Vlj = Pji 


and 77 — 77 ' £ Uq, as required. 

Conversely, if 77 — 77 ' £ Uq, choose an appropriate 3. We then get 

Vk x = (v') T x + p T x. 

It follows that 

A(i/,x) = A(r], x) + (3 t x, 


so that 


rj T T(k, x) — ^ 4 ( 77 , x) = ( 77 ') x + p T x — A(r}', x) + x = ( 77 ') x — A{rf , x) 


and the claim follows. 


□ 


The key tool we use to prove the theorem reinterprets V* ( 77 ) as the norm of an orthogonal projection. 
We believe this may be of independent interest. To set it up. let S = L 2 (Q. R D ) be the Hilbert 
space of square-integrable functions with respect to the input distribution p(x), define 

Wj(x) = Xj — E p(x) \Xj\ 

and 

C = span {wj)x<j< d ■ 

We then have 

Lemma 9. Let A(r), x) = A(r], x) — E p ( x ) [A(ri, X)]. Then 

v*(v) = \\Mv, -)-rr cMv, •) 

The second key observation, which we again believe is of independent interest, is that under certain 
circumstances, we can completely replace the normalizer A(rj, •) by ma x y ^y r/ T T(y, x). For this, 
we define 

E 00 {’q){x) = max 77 T T(k, x) = max rf^x 
k k 

and correspondingly let E 00 ( 77 ) = E p(x) [E ao ( 77 )(a;)]. 

Proof. By Lemma| 8 ] we have 

V*(v) = inf [ [A(rj, x) - A(ij) - (/3 T x - /3 T E p{x) [X])] 2 dp(x). 

/3eR d Jmd 

But now, we observe that this can be rewritten with the aid of the isomorphism R d ~ C defined by 
the identity 

/ 3 t x - p T E p{x) [X] = 'Y^pjWjfx) 


to read 


V*(v) = inf / x) - A(rj) - f] d p(x) = i(?7, •) - n c i(?7, ■) 

/ 6 C JRd 


as required. 


□ 
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Lemma 10. Suppose for each x G X, there is a unique k* = k* ( x ) such that k* (x) = arg max fc rfc x 
and such that for k k*, rj^x < rfc, x — Afar some A > 0. Then 

sup I Aty, x ) - Aty) - [Eooty)(x) - Booty)] I < Ke~ Aa . 
xex 


Proof Denote by bfa the centered version of . 
see that 

Eoo{ari){x) < A(arj, x) = aE^rj) (x)+log 1 + 


Using the identity 1 + t < e*, we immediately 

e K*- E oo(»)(x)] I < E 00 (arj){x)+Ke 


It follows that 


k^k* ( x ) 


Ep(x) [Eoo{ar]){X)} < E p(x) [A(ar], X)] < E p(x) [E^afaiX)} + Ke Aa . 

We thus have 

-Ke~ Aa < A(ar), x) - E^afa^x) < Ke ~ Aa , xGX. 

The claim follows. □ 


If we let 


V^iv) = inf Var {x) £«,(?/, X) 

r)'~n L 


Corollary 11. For a > log £ K , we have 

V*(arj) > V^{rj)a 2 — (1 + V^{ri)) a. 

Proof For this, observe first that if rf ~ //, then 

A{v\ x) 2 > Eooipnfatx) 2 - 2 E^arf^x) A{rf, x) - E^ (r/) (x) . 

By linearity of Efaf) in its ?y argument, and by Lemma [To| we therefore deduce 

A{rf, x) 2 > £I 00 (77 , )(a:) 2 a 2 - 2Ke~ Aa Eoofq'^x) a. 

Then using the inequality E p ( x ) [|/(X)|] < 1 + Var p ( x ) [/(X)], valid for any f G L 2 ( Q , 1R D ) with 
E p (x) [/] = 0, we thus deduce 

Var p(a;) [A(arf, X)] > Var p(x) [^(ty'XX)] a 2 - 2 Ke~ Aa (l + Var p(;c) [^(ry'XX)]) a. 
Taking the infimum over both sides, we get 

V*(v) > U E *(?y) - 2 Ke~ Aa (1 + U e *(77 )) a. 

□ 


We are now prepared to give the explicit example. It is defined by r]k = 0 if k > 2 and 

Vi j = 

and for all j, 


—a 

a 

d -1 


if d = 1 , 

o.w. 


V2j = 


d(d- 1 )’ 


where 


For convenience, also define 



b(x) = y^x d 
d 


(15) 

(16) 


Aa 
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and observe that 


Eoo(r]){x) 


[ ab ^ x ) if x . 

J d(d-i) 11 x o 

| ab(x ) 

l Tl-T 


Our goal will be to prove that 


l>V£(r,)> 


32d(d-l)' 


The claim will then follow by the above corollary. 
To see that ( 77 ) < 1, we simply note that 


max x\ < a < 1 , 


whence Varp^) \ri T x\ < 1 as well and we are done. 

The other direction requires more work. To prove it, we first prove the following lemma 
Lemma 12. With rj defined as in {Hi-®, we have 


inf E p(x) [.Boo ( 77 ') (X) 2 ] > 
»7 


1 

16 d{d — 1 ) 


Proof. Suppose ijk — rj k = /3 £ IR d . We can then write 

^ E p(x) [E x {r]'){X) 2 ] = M d ^ E E [Eoo{v)(x) - P T X f 

and we therefore define 


£(P) = E E l E oo(v)(x) - P T x] 2 
x&n 16 n 


= E 

x : aii =0 


Pi + P T X 


a ) 
d{d-l)J 


2 

+ 


/ a 6 (x) 

E -1 



noting that 


We therefore need to prove 


inf C = 2 d ■ inf E p(a .) [E^fq'^X) 2 ) . 


id-4 


C > 


d{d-l)' 

Holding / 32 :d fixed, we note that the optimal setting of /3i is given by 

a = -*E< 


We can therefore work with the objective 

m = E 

where we have defined 


x: xi —0 L 


2^ >J did- 1 )' 

j>2 y ’ 


(/3 T x-/3 T x^) f ab{x) T 

1 I 7 p X 


x i = , 1 

J 1 1 — Xa o . w . 


d- 1 
if j = 1 , 


Grouping into {x, } pairs, we end up with 

(fd T x — fi T x^Y C ab(x ) 


Aft :d) = E 

x: X 1 —X 2 —O 


d- 1 


— B t x 


ab(x^) 
d — 1 


— fy x~ 
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Now, supposing b(x) < | or b(x) > n f 1 + |, we have 

— b(x) | = |d — 1 — 26(x)| > 3. 

We will bound the tenns that satisfy this property. Indeed, supposing we fix such an x, at least one of 
the following must be true: either 


max 




ab(x^) 

d-1 


— (3 t x 


> 


(d-iy 


or 


(/3 T x — (3 t x~') 2 > 


(d-iy 

Indeed, suppose the first condition does not hold. Then necessarily 


ab(x) 

d-1 


— i3 t x 


< 


d-1 


and 


so that 


and 


- fPx 

d — 1 


< 


d-1 ’ 


a (b(x) - 1) < p Tx < a (b(x) + 1) 


d-1 


d-1 


a (6(aT) - 1) < < a (6(aT) + 1) 


d-1 

Now, if b(x) > b(x^) + 3, this immediately implies 

/ 3 T x — p T x^ > 

and, symmetrically, if b(x^) > b(x) + 3, we get 

fd T x~' — f3 T x > 


d-1 


d — 1 


d-1' 


d-1 3 - 2 “ 


Either way, the second inequality holds, whence the claim. Since there are at least 2“ 1 — 


\/¥+i 


> 


2 “ 1 choices of x satisfying the requirements of our line of reasoning, we get 2" 6 pairs, whence 

2 d- 4 a 2 2 d-4 


C(fo:d) > 


(d-1) 2 d(d — 1) ’ 


as claimed. 


□ 


We can apply this lemma to derive a variance bound, viz. 
Lemma 13. With r] as in (fl~5|>-(p~6l), we have 

VS iv) > 1 


32d(d — 1) ’ 

Proof. For this, observe that, with rj' being the value corresponding to rf k — r] k = (3. we have 

V£(v) = inf ^ ^ooW)( x ) 2 > inf ^ E°o(v'){x) 2 - 


p 2 d 


x^H xSL'H : x\=l 

Applying the previous result to the (I) — 1)-dimensional hypercube on which X\ = 1, we deduce 


1 


Vi(v) > o 


1 


1 


> 


1 


2 16(d - l)(d - 2) 32(d - l)(d - 2) “ 32d(d - 1) ‘ 


□ 
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Proof of The orem^from Lemma\n\ Putting everything together, we see first that 

V(ari) > V£(r])a 2 - 4e~ Aa a, 

where A = ■ But then this implies 


V* (arj) > 


32 d{d - 1) 


- 4e Aa a. 


On the other hand, 11 771 1 2 < 2, so ct 2 = 


> 


L , whence 


V*(m?) > 


IHII2 

64 d(d — 1) 


4e 



2(d-l) 


HHI2) 


which is the desired result. 


□ 
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